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ABSTRACT 

Tfiere liave been a number of prior attempts to theoretically 
justify the effectiveness of the inverse document frequency 
(IDF). Those that take as their starting point Robertson 
and Sparck Jones's probabilistic model are based on strong 
or complex assumptions. We show that a more intuitively 
plausible assumption sufflces. Moreover, the new assump- 
tion, while conceptually very simple, provides a solution to 
an estimation problem that had been deemed intractable by 
Robertson and Walker (1997). 

Categories and Subject Descriptors: H.3.3 [Informa- 
tion Search and Retrieval]: Retrieval models 

General Terms: Theory, Algorithms 

Keywords: inverse document frequency, IDF, probabilistic 
model, term weighting 

1. INTRODUCTION 

The inverse document frequency (IDF) [12] has been "in- 
corporated in (probably) all information retrieval systems" 
([6], pg. 77). Attempts to theoretically explain its empirical 
successes abound ([lIlllIIIIIlElElllO, writer aha). Our 
focus here is on explanations based on Robertson and Sparck 
Jones's probabilistic-model (RSJ-PM) paradigm of informa- 
tion retrieval [lOj . not because of any prejudice against other 
paradigms, but because a certain RSJ-PM-based justifica- 
tion of the IDF in the absence of relevance information has 
been promulgated by several influential authors 2 , 9 , 7, . 

RSJ-PM-based accounts use either an assumption due to 
Croft and Harper [2] that is mathematically convenient but 
not plausible in real settings, or a complex assumption due 
to Robertson and Walker [TT]. We show that the IDF can be 
derived within the RSJ-PM framework via a new assump- 
tion that directly instantiates a highly intuitive notion, and 
that, while conceptually simple, solves an estimation prob- 
lem deemed intractable by Robertson and Walker [llj . 

2. CROFT-HARPER DERIVATION 

In the (binary-independence version of the) RSJ-PM, the 
i*'' term is assigned weight 

log —r, r ■ (1) 
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where p^ P{X, = l\R = y),q, P{X, = 1 | i? = n), 
Xi is an indicator variable for the presence of the i*'' term, 
and _R is a relevance random variable. Croft and Harper 
proposed the use of two assumptions to estimate pi and 
qi in the absence of relevance information. CH-1, which is 
unobjectionable, simply states that most of the documents 
in the corpus are not relevant to the query. This allows us 

to set gp^ , where rii is the number of documents in 

the corpus that contain the i*** term, and A'' is the num- 
ber of documents in the corpus. The second assumption, 
CH-2, is that all query terms share the same probability tt 
of occurring in a relevant documenfl Under CH-2, one sets 
pCn 4£f jj.^ gj^^ thus IT]) becomes 

TT -flog , (2) 

rii 

where tt' = log(7r/(l — tt)) is constant (and is if tt = .5). 
Quantity ((2]| is essentially the IDF. 

CH-2 is an ingenious device for pushing the derivation 
above through. However, intuition suggests that the oc- 
currence probability of query terms in relevant documents 
should be at least somewhat correlated with their occur- 
rence probability in arbitrary documents within the corpus, 
and hence not constant. For example, a very frequent term 
can be expected to occur in a noticeably large fraction of 
any particular subset of the corpus, including the relevant 
documents. Contrariwise, a query term might be relatively 
infrequent overall due to having a more commonly used syn- 
onym; such a term would still occur relatively infrequently 
even within the set of (truly) relevant documents|3 

3. ROBERTSON- WALKER DERIVATION 

Robertson and Walker (RW) [TT] also object to CH-2, on 
the grounds that for query terms with very large document 
frequencies, weight ([2]) can be negative. This anomaly, they 

show, arises precisely because pf^ is constant. They then 
propose the following alternative; 



where tt is the Croft-Harper constant, but reinterpreted as 
the estimate for pi just when m = 0. One can check that 
pf^ £ [tt, 1] slopes up hyperbolically in rii. Applying pf-'^ 

^This can be relaxed to apply to just the query terms in the 
document in question. 

^Indeed, one study [5] did find Pi increasing with rii. 



and q^^^ to the term-weight scheme ((TJ yields 

N 



Tv' + log ■ 



(3) 



(which is positive as long as tt > .5). 



4. NEW ASSUMPTION 

The estimate pf-'^ increases monotonically in rii , which is 
a desirable property, as we have argued above. However, its 
exact functional form does not seem particularly intuitive. 
RW motivate it simply as an approximation to a linear form; 
approximation is necessary, they claim, because 

the straight-line model [i.e., pi linear in Qi and 
hence Ui by CH-1] is actually rather intractable, 
and does not lead to a simple weighting formula 

([n, pg. 18)0 

Despite this claim, we show here that there exists a highly 
intuitive linear estimate that leads to a term weight varying 
inversely with document frequency. 

There are two main principles that motivate our new es- 
timate. First, as already stated, any estimate of pi should 
be positively correlated with Ui . The second and key insight 
is that query terms should have a higher occurrence proba- 
bility within relevant documents than within the document 
collection as a whole. Thus, if the i**^ term appears in the 
query, we should "lift" its estimated occurrence probability 
in relevant documents above Ui/N, which is its estimated 
occurrence probability in general documents. This leads us 
to the following intuitive estimate, which is reminiscent of 
"add-one smoothing" used in language modeling (more on 
this below): 



dof Tli + L 



(4) 



N + L 

Here the L > in the numeratoiQ is a "lift" or "boost" 
constantlfl Plugging pi and gp^ into yields the term 
weight 



log 



N+L p 



N+L 



log 1 + - 



which varies inversely in Ui, as desired. 

Furthermore, as hinted at above, selecting L's value is 
equivalent to selecting pi's value for query terms whose doc- 
ument frequency is 0. That is, L/{N -I- L) is directly analo- 
gous to TT in RW's derivation. Indeed, choosing L = A'' is just 
like choosing tt = 0.5, which is commonly done in presen- 
tations of the Croft-Harper derivation in order to eliminate 
the leading constant tt' in ([2]); doing so in our case yields 
the following term weight, which is the "usual" form of the 
IDF ([S], pg. 184): 



log 1 + 



N 



^The fact that RW's Figure 2 depicts the linear scenario 
graphically appears to have led to some mistaken impres- 
sions (e.g., '5 , pg. 18, coincidentally) that this is the math- 
ematical model that RW actually employed. 
*The L in the denominator ensures that Pi <1. 
^Since the RSJ-PM document-scoring function only accu- 
mulates weights for terms appearing in the query, it is fine 
to compute the pi's offline, that is, before the query is seen. 



Finally, note that pi is linear in Ui ; we have thus contradicted 
the assertion quoted above that developing a "straight-line" 
model is "intractable" |11) . 

5. ONWARD AND UPWARD 

An interesting direction for future work is to consider lift 
functions L{ni) that depend on rii. It can be shown that 
different choices of L{ni) allow one to model non-linear de- 
pendencies of Pi on Ui that occur in real data, such as the 
approximately logarithmic dependence observed in TREC 
corpora by Greiff [5] . Importantly, seemingly similar choices 
of L{ni) yield strikingly different term-weighting schemes; 
it would be interesting to empirically compare these new 
schemes against the classic IDF. 
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